Joint audio-visual speech processing for recognition and enhancement

نویسندگان

Gerasimos Potamianos

Chalapathy Neti

Sabine Deligne

چکیده

Visual speech information present in the speaker’s mouth region has long been viewed as a source for improving the robustness and naturalness of human-computer-interfaces (HCI). Such information can be particularly crucial in realistic HCI environments, where the acoustic channel is corrupted, and as a result, the performance of traditional automatic speech recognition (ASR) systems falls below usability levels. In this paper, we review two general approaches that utilize visual speech to improve ASR in acoustically challenging environments: One directly combines features extracted from the acoustic and visual channels, aiming at superior recognition performance of the resulting audio-visual ASR system. The other seeks to eliminate the noise present in the acoustic features, aiming at their audio-visual based enhancement, and thus resulting in improved speech recognition. We present a number of techniques recently introduced in the literature for bimodal ASR and enhancement, and we study their performance using a suitable audio-visual database. Among the methods considered, our recognition experiments demonstrate that decision based combination of audio and visual features significantly outperforms simpler feature based integration methods for audio-visual ASR. For audio feature enhancement, a non-linear technique is more successful than a regression-based approach. As expected, bimodal ASR and enhancement outperform their audio-only counterparts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effective visually-derived Wiener filtering for audio-visual speech processing

This work presents a novel approach to speech enhancement by exploiting the bimodality of speech and the correlation that exists between audio and visual speech features. For speech enhancement, a visually-derived Wiener filter is developed. This obtains clean speech statistics from visual features by modelling their joint density and making a maximum a posteriori estimate of clean audio from v...

متن کامل

Audio-Visual Speech Processing: Progress and Challenges

This keynote focuses on using visual channel information to improve automatic speech processing for human computer interaction. Two main issues are discussed: the extraction and representation of visual speech, as well as its fusion with traditional acoustic information. The talk mostly considers applying these techniques to automatic speech recognition, however additional areas of interest are...

متن کامل

Speech Enhancement and Recognition in Meetings With an Audio-Visual Sensor Array

This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integ...

متن کامل

Introducing the Turbo-Twin-HMM for Audio-Visual Speech Enhancement

Models for automatic speech recognition (ASR) hold detailed information about spectral and spectro-temporal characteristics of clean speech signals. Using these models for speech enhancement is desirable and has been the target of past research efforts. In such model-based speech enhancement systems, a powerful ASR is imperative. To increase the recognition rates especially in low-SNR condition...

متن کامل

Lip-reading from parametric lip contours for audio- visual speech recognition

This paper describes the incorporation of a visual lip tracking and lip-reading algorithm that utilizes the affine-invariant Fourier descriptors from parametric lip contours to improve the audio-visual speech recognition systems. The audio-visual speech recognition system presented here uses parallel hidden Markov models (HMMs), where a joint decision, using an optimal decision rule, is made af...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Joint audio-visual speech processing for recognition and enhancement

نویسندگان

چکیده

منابع مشابه

Effective visually-derived Wiener filtering for audio-visual speech processing

Audio-Visual Speech Processing: Progress and Challenges

Speech Enhancement and Recognition in Meetings With an Audio-Visual Sensor Array

Introducing the Turbo-Twin-HMM for Audio-Visual Speech Enhancement

Lip-reading from parametric lip contours for audio- visual speech recognition

عنوان ژورنال:

اشتراک گذاری